Anomalies are data points that are different from other observations in some way, typically measured against a model fit to the data. On the contrary with the ordinary descriptive statistics, we are interested here to found where these anomalous data points exist and not exclude them as outliers.
We assume the anomaly detection task is unsupervised, i.e. we don’t have training data with points labeled as anomalous. Each data point passed to an anomaly detection model is given a score indicating how different the point is relative to the rest of the dataset. The calculation of this score varies between models, but a higher score always indicates a point is more anomalous. Often a threshold is chosen to make a final classification of each point as typical or anomalous; this post-processing step is left to the user.
The GraphLab Create (GLC) Anomaly Detection toolkit currently includes three models for two different data contexts:
In this short note, we demonstrate how the Moving Z-Score and Bayesian Changepoints models can be used to reveal anomalies in a time series object. As an example we are going to use the "Crude Oil Prices: Brent - Europe" time series, FRED-DCOILBRENTEU
, as it is currently provided by the Quandl database of finance and economic data and the Federal Reserve Bank of St. Luis. This times series covers the daily closing prices of Crude Oil Brent - Europe (Dollars per Barrel, Not Seasonally Adjusted) starting from May 1987 to May 2016. It follows a pretty volatile behavior across the years, and we hope to find out where the most anomalous spot values are. For notes and definitions, please see the corresponding US Energy Information Agency (eia), Explanatory Notes.
In a first step of our analysis, we are going to use the GLC Moving Z-Score implementation. This unsupervised learning model fits a moving average to a univariate time series and identifying that way points that are far from the fitted curve. The MovingZScoreModel
works with either TimeSeries
or SFrame
inputs. A uniform sampling rate is assumed and the data window must be defined in terms of number of observations.
The moving Z-score for a data point $x_{t}$ is simply the value of $x_{t}$ standardized by subtracting the moving mean just prior to time $t$ and dividing by the moving standard deviation which is calculated for the same time interval. In particular, assuming that $w$ stands for the window_size
in terms of the number of observations the moving Z-score is defined as:
where the moving average is:
\begin{equation*} \bar{x}_{t} = (1/w)\,\sum_{i=t-w}^{t-1}x_{i}, \end{equation*}and the standard deviation for the same time interval:
\begin{equation*} s_{t} = \sqrt{(1/w)\,\sum_{i=t-w}^{t-1}(x_{i}-\bar{x}_{t})^{2}}. \end{equation*}Notes:
The moving Z-score at points within the window_size
observations of the beginning of a series are not defined, because there are insufficient points to compute the moving average and moving standard deviation. This is represented by missing values.
Missing values in the input dataset are assigned missing values (‘None’) for their anomaly scores as well.
If there is no variation in the values preceding a given observation, the moving Z-score can be infinite or undefined. If the given observation is equal to the moving average, the anomaly score is coded as 'nan'
; if the observation is not equal to the moving average, the anomaly score is 'inf'
.
As a next step of our analysis we are going to use the GLC Bayesian Changepoints model and compare the results of these two methods. The Bayesian Changepoints implementation scores changepoint probability in a univariate sequential dataset, often a time series. Changepoints are abrupt changes in the mean or variance of a time series. For instance, during an economic recession, stock values might suddenly drop to a very low value. The time at which the stock value dropped is called a changepoint.
The Bayesian Changepoints model is an implementation of the Bayesian Online Changepoint Detection algorithm developed by Ryan Adams and David MacKay. This algorithm computes a probability distribution over the possible run lengths at each point in the data, where run length refers to the number of observations since the last changepoint. When the probability of a 0-length run spikes, there is most likely a change point at the current data point.
More specifically, the algorithm follows the procedure below:
Step 1: Observe new datum $x_{t}$ and evaluate the likelihood of seeing this value for each possible run length. This is a probability vector, with an element for all possible run lengths. A Gaussian distribution between each pair of changepoints is assumed.
\begin{equation*} L(r)= P(x|x_{r}) \end{equation*}Step 2: For each possible run length, $r>0$, at current time $t$, calculate the probability of growth. expected_runlength
is a parameter describing the a-priori best guess of run length. The larger expected_runlength
is, the stronger the evidence must be in the data to support a high changepoint probability.
Step 3: Calculate probability of change, or $r=0$.
\begin{equation*} P_{t}(runlength\equiv 0)= \sum_{r_{prev}}\left[P_{t−1}(runlength\equiv r_{prev})\ast L(0)\ast \left(\frac{1}{expected\_runlength}\right)\right] \end{equation*}Step 4: Normalize the probability. For all run length probabilities at time $t$, divide by the sum of all run length probabilities.
\begin{equation*} P_{t}(runlength\equiv r_{i})=\frac{P_{t}(runlength\equiv r_{i})}{\sum_{r}P_{t}(runlength\equiv r)} \end{equation*}For each incoming point, this process is repeated.
This per-point update is why the method is considered an online learning algorithm.
As described, the algorithm scores each point $x_{t}$ immediately, but if the user can afford to wait several observations, it is often more accurate to assign lagged changepoint scores. The number of observations to wait before scoring a point is set with the lag
parameter.
In [1]:
import graphlab as gl
import matplotlib.pyplot as plt
In [2]:
fred_dcoilbrenteu = gl.SFrame.read_csv('./FRED-DCOILBRENTEU.csv')
In [3]:
fred_dcoilbrenteu
Out[3]:
Next we transform the DATE
column in an appropriate timestamp format, and the fred_dcoilbrenteu
SFrame in a TimeSeries object.
In [4]:
import time
import dateutil
def _unix_timestamp_to_datetime(x):
import datetime
import pytz
return dateutil.parser.parse(x)
fred_dcoilbrenteu['DATE'] = fred_dcoilbrenteu['DATE'].apply(_unix_timestamp_to_datetime)
fred_dcoilbrenteu = gl.TimeSeries(fred_dcoilbrenteu, index='DATE')
fred_dcoilbrenteu
Out[4]:
We can plot the fred_dcoilbrenteu
time series set as follows.
In [6]:
%matplotlib inline
def plot_time_series(timestamp, values, title, **kwargs):
plt.rcParams['figure.figsize'] = 14, 7
plt.plot_date(timestamp, values, fmt='g-', tz='utc', **kwargs)
plt.title(title)
plt.xlabel('Year')
plt.ylabel('Dollars per Barrel')
plt.rcParams.update({'font.size': 16})
plot_time_series(fred_dcoilbrenteu['DATE'], fred_dcoilbrenteu['VALUE'],\
'Crude Oil Prices: Brent - Europe [FRED/DCOILBRENTEU]')
In [7]:
window_size = 252 # average trading days per year
model_moving_zscore =gl.anomaly_detection.moving_zscore.create(fred_dcoilbrenteu,
window_size, feature='VALUE')
The primary output of the Moving Z-score model is the scores
field. This TimeSeries object contains:
row id/time
: ID of the corresponding row in the input dataset. Here the dataset is a TimeSeries object and the model returns the DATE
timestamp. If it was an SFrame, this column would be filled with the row numbers of the input data.anomaly score
: absolute value of the moving Z-score. A score of 0 indicates that the value is identical to the moving average. The higher the score, the more likely a point is to be an anomaly.VALUE
: the recorded value of Dollars per Barrel of "Crude Oil Brent - Europe".model update time
: time that the model was updated. This is particularly useful for model updating.
In [8]:
scores = model_moving_zscore.scores.to_sframe()
scores.print_rows(num_rows=10, max_row_width=100)
In [9]:
scores[252-10:252+10].print_rows(num_rows=60, max_row_width=100)
Of course, the first 252 rows of the scores output don't have a moving average or Z-score. This is because the moving window does not have sufficient data for those observations.
To reveal the 30, lets say, more anomalous data points we can sort the scores
SFrame as follows.
In [10]:
scores.sort('anomaly_score', ascending=False).print_rows(num_rows=30, max_row_width=100)
Of cource, a lot more anomalous observations may exist in the fred_dcoilbrenteu
time series. A good way to make a final decision on that, is to look at the approximate distribution of the anomaly scores with the SArray.sketch_summary()
tool, then get a threshold for the anomaly score with the sketch summary's quantile method. Here we declare the top 1% of the data to be anomalies, characterizing that way 71 data points as "anomalous".
In [11]:
sketch = scores['anomaly_score'].sketch_summary()
threshold = sketch.quantile(0.99)
anomalies = scores[scores['anomaly_score'] > threshold]
anomalies.print_rows(num_rows=30, max_row_width=100)
In the figure below, we plot the original FRED/DCOILBRENTEU
time series of "Dollars per Barrel of Crude Oil Brent - Europe", its Moving Average across the years, and the data points that we found to be anomalous.
In [12]:
%matplotlib inline
plot_time_series(fred_dcoilbrenteu['DATE'], fred_dcoilbrenteu['VALUE'],\
'Crude Oil Prices: Brent - Europe [FRED/DCOILBRENTEU]', label='FRED/DCOILBRENTEU')
plt.plot_date(scores['DATE'], scores['moving_average'], fmt='b-', tz='utc', lw=2, label='Moving Average')
plt.plot(anomalies['DATE'], anomalies['VALUE'], 'rx', markersize=12, markeredgewidth=1.3, label='Anomalies')
plt.legend(loc='upper left', prop={'size': 16})
plt.show()
In [13]:
model_bayesian_changepoints = gl.anomaly_detection.bayesian_changepoints.\
create(fred_dcoilbrenteu,
feature='VALUE',
# avg trading days per year
expected_runlength = 252,
# avg trading days per fiscal quarter
lag=63)
The primary output of the Moving Z-score model is the scores
field. This TimeSeries object contains:
row id/time
: ID of the corresponding row in the input dataset. Here the dataset is a TimeSeries object and the model returns the DATE
timestamp. If it was an SFrame, this column would be filled with the row numbers of the input data.changepoint_score
: The probability that the given point is a changepoint. This value is in a range between 0 and 1.VALUE
: the recorded value of Dollars per Barrel of "Crude Oil Brent - Europe".model update time
: time that the model was updated. This is particularly useful for model updating.
In [14]:
scores2 = model_bayesian_changepoints.scores.to_sframe()
scores2.print_rows(num_rows=10, max_row_width=100)
To reveal the 30, lets say, more anomalous data points we can sort the scores
SFrame as follows.
In [15]:
scores2.sort('changepoint_score', ascending=False).print_rows(num_rows=30, max_row_width=100)
One interesting thing is that if you look at the tail of scores, you will see a handful of missing values. These data points have insufficient data after them to compute lagged changepoint scores. The number of missing values in the tail of the dataset can be reduced by equally reducing the lag
parameter in our learning model. However, the returned results will be less accurate. Alternatively, one can choose to update the model with new data.
In [16]:
scores2.tail(80).print_rows(num_rows=80, max_row_width=100)
Of cource, a lot more anomalous observations may exist in the fred_dcoilbrenteu
time series. A good way to make a final decision on that, is to look at the approximate distribution of the changepoint scores with the SArray.sketch_summary()
tool, then get a threshold for the changepoint score with the sketch summary's quantile method. Again, we declare the top 1% of the data to be anomalies, characterizing that way 75 data points as "anomalous".
In [17]:
sketch2 = scores2['changepoint_score'].sketch_summary()
threshold2 = sketch2.quantile(0.99)
changepoints = scores2[scores2['changepoint_score'] > threshold2]
changepoints.print_rows(num_rows=105, max_row_width=100)
In the figure below, we plot the original FRED/DCOILBRENTEU
time series of "Dollars per Barrel of Crude Oil Brent - Europe", its Moving Average across the years, and the data points that we found to be anomalous with both the Moving Z-Score and the Bayesian Changepoint model.
In [18]:
%matplotlib inline
plt.rcParams['figure.figsize'] = 14, 24
plt.figure(1)
plt.subplot(3,1,1)
plt.plot_date(fred_dcoilbrenteu['DATE'], fred_dcoilbrenteu['VALUE'],\
fmt='g-', tz='utc', label='FRED/DCOILBRENTEU')
plt.plot_date(scores['DATE'], scores['moving_average'],\
fmt='b-', tz='utc', lw=2, label='Moving Average')
plt.xlabel('Year')
plt.ylabel('Dollars per Barrel')
plt.title('Crude Oil Prices: Brent - Europe [FRED/DCOILBRENTEU]')
plt.rcParams.update({'font.size': 16})
plt.plot(anomalies['DATE'], anomalies['VALUE'],\
'bx', markersize=12, markeredgewidth=1.3, label='Anomalies [Moving Z-Score]')
plt.legend(loc='upper left', prop={'size': 16})
plt.subplot(3,1,2)
plt.plot_date(fred_dcoilbrenteu['DATE'], fred_dcoilbrenteu['VALUE'],\
fmt='g-', tz='utc', label='FRED/DCOILBRENTEU')
plt.plot_date(scores['DATE'], scores['moving_average'],\
fmt='b-', tz='utc', lw=2, label='Moving Average')
plt.xlabel('Year')
plt.ylabel('Dollars per Barrel')
plt.title('Crude Oil Prices: Brent - Europe [FRED/DCOILBRENTEU]')
plt.rcParams.update({'font.size': 16})
plt.plot(changepoints['DATE'], changepoints['VALUE'],\
'rx', markersize=12, markeredgewidth=1.3, label='Anomalies [Bayesian Changepoints]')
plt.legend(loc='upper left', prop={'size': 16})
plt.subplot(3,1,3)
plt.plot_date(scores2['DATE'], scores2['changepoint_score'],\
fmt='r-', tz='utc', lw=2, label='Bayesian Changepoint Probability')
plt.rcParams.update({'font.size': 16})
plt.xlabel('Year')
plt.ylabel('Changepoint Probability')
plt.title('Crude Oil Prices: Brent - Europe [FRED/DCOILBRENTEU]')
plt.legend(loc='upper left', prop={'size': 16})
plt.show()
Interestingly enough, the Bayesian Changepoint model revealed some new anomalous points which were obviously missed by the Moving Z-Score algorithm. More specifically, new anomalous data points observed during the periods 1998-2000 and 2006-2010. More importantly, the great depreciation of Crude Oil Prices (Brent-Europe) during the recent great recession (2007-2008) is now characterized an an additional anomalous point which is of course true.
In [ ]: